Doubly Robust Policy Evaluation and Optimization

نویسندگان

چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Doubly Robust Policy Evaluation and Optimization

We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy gi...

متن کامل

More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the mo...

متن کامل

Doubly Robust Policy Evaluation and Learning

We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions an...

متن کامل

Doubly Robust Off-policy Evaluation for Reinforcement Learning

We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications....

متن کامل

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Proof. For the base case t = H + 1, since V 0 DR = V (s H+1) = 0, it is obvious that at the (H + 1)-th step the estimator is unbiased with 0 variance, and the theorem holds. For the inductive step, suppose the theorem holds for step t + 1. At time step t, we have: V t V H+1−t DR = E t V H+1−t DR 2 − E t V (s t) 2 = E t V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Statistical Science

سال: 2014

ISSN: 0883-4237

DOI: 10.1214/14-sts500